1. Introduction

In this exploratory data analysis, our group examines Dengue Fever cases in the city of Mérida, Mexico, using data extracted from _______. Dengue Fever is a mosquito-borne viral infection caused by the dengue virus, which continues to pose a significant public health concern in tropical and subtropical regions.

Some of Dengue Fever’s most common symptoms include sudden drops in blood pressure, nausea, and muscle pain, similar to COVID-19 symptoms. Researchers have found that Merida, which is the capital of the state of Yucatán, experiences recurrent dengue outbreaks due to having a tropical climate, seasonal rainfall and standing water, and an urban population density.

Our group was initially interested in exploring public health-related datasets, due to our shared experiences of living throughout the COVID-19 pandemic. We noticed how COVID-19 had similar symptoms to Dengue Fever and had similar effects in profoundly impacting communities worldwide, highlighting how infectious diseases can spread rapidly and sporadically across different regions. By being able to analyze a virus like Dengue Fever, our group can examine how a viral disease can be influenced by environmental and geographic factors.

To elaborate more on the dataset, it is comprised of 540 rows, which are all distinctly different regions within Merida. Each region is quantified based on their coordinates from the Universal Transverse Mercator (UTM) system, a grid-based coordinate system used to map locations using meters. The dataset initially had 8 columns: the X and Y UTM coordinates of each region, SP_ID (an identifier for each region), CVEGEO (geographic identifier code used by the Mexican Geographic Agency), and the number of cases in each region across four different years (2012-2015). Later mentioned in the Data Cleaning step, SP_ID and CVEGEO are not necessary for our overall analysis.

2. Project Setup, Data Loading, and Data Cleaning

Before developing any visualizations or analytical models, we prepared the R environment by loading our primary dataset, “Merida_Den12_13_14_15.csv,” along with a complete suite of R packages. We used the tidyverse package, which provides core tools like dplyr for data transformation and ggplot2 for static visualization. For our more advanced interactive plots, we loaded plotly (used for the animated density map) and leaflet (used for the interactive Year-over-Year map). We also included sf to read and handle the simple features (geospatial) data for Merida. Once all libraries were loaded, we imported the dataset into a data frame. The resulting data frame contained 540 rows and 8 columns, representing spatial identifiers, coordinate values, and annual Dengue case counts. This formed the foundation for our exploratory analysis.

Data Cleaning is a crucial step to ensure the integrity and relevance of the data prior to our analysis. We want to ensure that all the data required for this project is clean, validated, and standardized so that our group can create data visualizations using the shared data.

The raw dataset contains several columns, including SP_ID and CVEGEO, which serve as identifiers for position and case record. After discussing the goals for the project, we determined that our exploratory analysis is focused exclusively on the relationship between spatial coordinates and case counts. Therefore, the important features for this analysis are the X and Y coordinates and the annual case count columns (Den2012 through Den2015). The SP_ID and CVEGEOCVEGEO columns, while potentially useful for future work like joining with demographic or climate data, are not required for our current visualization and modeling goals. Therefore, we dropped the project’s irrelevant columns, creating a standard data frame named data_cleaned.

Next, we checked the dataframe for missing data because null values can introduce significant bias, propagate errors in calculations, and cause visualization failures. The R console output confirms that all columns returned 0 null values. This finding validates the completeness of our dataset, allowing us to proceed with our analysis without further cleaning procedures, ensuring our results are based on the full set of observations.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(leaflet) 
library(sf) 
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(htmltools)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
data <- read.csv('Merida_Den12_13_14_15.csv')

head(data)
##          X       Y SP_ID        CVEGEO Den2012 Den2013 Den2014 Den2015
## 1 231639.6 2317510     0 3104100010070      17       6       2       2
## 2 231323.1 2316622     1 3104100010085      35      15      10      18
## 3 230769.7 2316113     2 310410001009A      35      14       8      17
## 4 233250.8 2316208     3 3104100010136      40       8       6      10
## 5 233266.3 2317442     4 3104100010140      12       8       4       4
## 6 232736.6 2317238     5 3104100010155      30       8       1       5
data_cleaned <- data %>%
  select(X, Y, starts_with("Den"))
na_counts <- colSums(is.na(data_cleaned))
print(na_counts)
##       X       Y Den2012 Den2013 Den2014 Den2015 
##       0       0       0       0       0       0
head(data_cleaned)
##          X       Y Den2012 Den2013 Den2014 Den2015
## 1 231639.6 2317510      17       6       2       2
## 2 231323.1 2316622      35      15      10      18
## 3 230769.7 2316113      35      14       8      17
## 4 233250.8 2316208      40       8       6      10
## 5 233266.3 2317442      12       8       4       4
## 6 232736.6 2317238      30       8       1       5

4. Exploratory Analysis 2: Static Geospatial Patterns - can rename

#maybe overlay the map of the location
# This plot shows the geospatial location of Dengue incidence in Merida, Mexico. We can potentially aesthetically improve this plot later using QGIS or ArcGIS.
plot(data$X, data$Y, 
     main = "Spatial Locations of Sampling Points",
     xlab = "X Coordinate", 
     ylab = "Y Coordinate", 
     pch = 19, col = "blue")

library(ggplot2)
# 1️ Read file
library(sf)

town_map <- st_read("Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp")
## Reading layer `Merida_Den12_13_14_15' from data source 
##   `/Users/amanshaik/Documents/GitHub/dengueAnalysisTechWriting/Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 540 features and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 212287.2 ymin: 2309495 xmax: 236386 ymax: 2333181
## Projected CRS: Mexico ITRF2008 / UTM zone 16N
st_crs(town_map)
## Coordinate Reference System:
##   User input: Mexico ITRF2008 / UTM zone 16N 
##   wkt:
## PROJCRS["Mexico ITRF2008 / UTM zone 16N",
##     BASEGEOGCRS["Mexico ITRF2008",
##         DATUM["Mexico ITRF2008",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",6365]],
##     CONVERSION["UTM zone 16N",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",0,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",-87,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",0.9996,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",500000,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",0,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["Mexico east of 90°W, onshore and offshore."],
##         BBOX[17.81,-90,25.77,-84.64]],
##     ID["EPSG",6371]]
disease_data <- read.csv('Merida_Den12_13_14_15.csv')

# make sure CRS is same, convert to sf
disease_sf <- st_as_sf(disease_data, coords = c("X", "Y"), crs = st_crs(town_map))
names(disease_data)
## [1] "X"       "Y"       "SP_ID"   "CVEGEO"  "Den2012" "Den2013" "Den2014"
## [8] "Den2015"
# 4 generate plot
ggplot() +
  geom_sf(data = town_map, fill = "gray95", color = "black", linewidth = 0.3) +
  geom_sf(data = disease_sf, aes(color = Den2012, size = Den2012)) +
  scale_color_viridis_c(option = "plasma") +
  theme_minimal() +
  labs(title = "Spatial Locations",
       color = "Incidence",
       size = "Incidence")

6. Exploratory Analysis 4: Interactive Case Count Map

# 1. Convert data to sf and transform to WGS84 for Leaflet
# We use EPSG:6371 as identified in the original Section 7
data_sf <- st_as_sf(data, coords = c("X", "Y"), crs = 6371)
data_sf_wgs84 <- st_transform(data_sf, crs = 4326)

# 2. Load and transform the neighborhood shapefile
town_map_raw <- st_read("Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp")
## Reading layer `Merida_Den12_13_14_15' from data source 
##   `/Users/amanshaik/Documents/GitHub/dengueAnalysisTechWriting/Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 540 features and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 212287.2 ymin: 2309495 xmax: 236386 ymax: 2333181
## Projected CRS: Mexico ITRF2008 / UTM zone 16N
town_map_wgs84 <- st_transform(town_map_raw, crs = 4326)

# 3. Create a sequential color palette for absolute cases
# We find the global max to keep the scale consistent across years
all_cases <- c(data_sf_wgs84$Den2012, data_sf_wgs84$Den2013, data_sf_wgs84$Den2014, data_sf_wgs84$Den2015)
global_max_cases <- max(all_cases, na.rm = TRUE)
pal_cases <- colorNumeric(
  palette = "Reds",
  domain = c(0, global_max_cases),
  na.color = "#a0a0a0"
)

# 4. Create popups for each year
popup_2012 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2012 Cases: %d", SP_ID, Den2012), HTML)
popup_2013 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2013 Cases: %d", SP_ID, Den2013), HTML)
popup_2014 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2014 Cases: %d", SP_ID, Den2014), HTML)
popup_2015 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2015 Cases: %d", SP_ID, Den2015), HTML)

# 5. Create the interactive leaflet map
m_cases <- leaflet(data_sf_wgs84) %>%
  # Add base map layers
  addProviderTiles(providers$CartoDB.Positron, group = "Light Map") %>%
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") %>%
  
  # Add Neighborhoods layer
  addPolygons(
    data = town_map_wgs84,
    group = "Neighborhoods",
    fillOpacity = 0.1,
    color = "#444",
    weight = 1,
    opacity = 0.7,
    popup = ~SP_ID
  ) %>%
  
  # Set the map to focus on the data
  fitBounds(
    ~as.numeric(st_bbox(geometry)[1]),
    ~as.numeric(st_bbox(geometry)[2]),
    ~as.numeric(st_bbox(geometry)[3]),
    ~as.numeric(st_bbox(geometry)[4])
  ) %>%

  # ==== Layer 1: 2012 Cases ====
  addCircleMarkers(
    group = "2012 Cases",
    color = ~pal_cases(Den2012),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2012,
    label = ~as.character(Den2012)
  ) %>%
  
  # ==== Layer 2: 2013 Cases ====
  addCircleMarkers(
    group = "2013 Cases",
    color = ~pal_cases(Den2013),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2013,
    label = ~as.character(Den2013)
  ) %>%
  
  # ==== Layer 3: 2014 Cases ====
  addCircleMarkers(
    group = "2014 Cases",
    color = ~pal_cases(Den2014),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2014,
    label = ~as.character(Den2014)
  ) %>%
  
  # ==== Layer 4: 2015 Cases ====
  addCircleMarkers(
    group = "2015 Cases",
    color = ~pal_cases(Den2015),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2015,
    label = ~as.character(Den2015)
  ) %>%
  
  # ==== Legend ====
  addLegend(
    pal = pal_cases,
    values = c(0, global_max_cases),
    title = "Case Count",
    position = "bottomright",
    opacity = 1
  ) %>%

  # ==== Layer Control ====
  addLayersControl(
    baseGroups = c("2012 Cases", "2013 Cases", "2014 Cases", "2015 Cases"),
    overlayGroups = c("Light Map", "Satellite", "Neighborhoods"),
    options = layersControlOptions(collapsed = FALSE)
  )

# Display the map
m_cases

7. Exploratory Analysis 5: Analyzing Year-over-Year (YoY) Change - can rename

# 1. Calculate Year-over-Year (YoY) percentage change
# We must handle cases where the denominator (previous year) is 0.
calculate_yoy <- function(current, previous) {
  case_when(
    previous == 0 & current == 0 ~ 0,  # 0 to 0 is 0% change
    previous == 0 & current > 0 ~ Inf, # 0 to N is infinite % change
    previous > 0 ~ (current - previous) / previous
  )
}

data_yoy <- data %>%
  mutate(
    YoY_2013 = calculate_yoy(Den2013, Den2012),
    YoY_2014 = calculate_yoy(Den2014, Den2013),
    YoY_2015 = calculate_yoy(Den2015, Den2014)
  )

# 2. Clean up Inf/-Inf/NaN values for visualization. We'll set Inf to NA.
data_yoy <- data_yoy %>%
  mutate(across(starts_with("YoY_"), ~if_else(is.finite(.), ., NA_real_)))

# 3. Convert to a geospatial 'sf' object and transform coordinates
# The .prj file (EPSG:6371) tells us this is Mexico ITRF2008 / UTM zone 16N.
# Leaflet requires standard WGS84 (EPSG:4326) coordinates (lat/lon).
data_sf <- st_as_sf(data_yoy, coords = c("X", "Y"), crs = 6371)
data_sf_wgs84 <- st_transform(data_sf, crs = 4326)

# 4. Define a diverging color palette
# We cap the domain from -1 (-100%) to 2 (200%) to handle outliers
cap_range <- c(-1, 2)
pal_yoy <- colorNumeric(
  palette = "RdYlBu", 
  domain = cap_range, 
  na.color = "#a0a0a0",
  reverse = TRUE # Make red positive (increase), blue negative (decrease)
)

# 5. Create formatted popup and label content
# Popups appear on click
popup_2013 <- ~lapply(sprintf(
  "<strong>Location ID: %s</strong><br/>2012 Cases: %d<br/>2013 Cases: %d<br/>YoY Change: %s",
  SP_ID, Den2012, Den2013, scales::percent(YoY_2013, accuracy = 0.1)
), HTML)

popup_2014 <- ~lapply(sprintf(
  "<strong>Location ID: %s</strong><br/>2013 Cases: %d<br/>2014 Cases: %d<br/>YoY Change: %s",
  SP_ID, Den2013, Den2014, scales::percent(YoY_2014, accuracy = 0.1)
), HTML)

popup_2015 <- ~lapply(sprintf(
  "<strong>Location ID: %s</strong><br/>2014 Cases: %d<br/>2015 Cases: %d<br/>YoY Change: %s",
  SP_ID, Den2014, Den2015, scales::percent(YoY_2015, accuracy = 0.1)
), HTML)

# Labels appear on hover
label_2013 <- ~scales::percent(YoY_2013, accuracy = 0.1)
label_2014 <- ~scales::percent(YoY_2014, accuracy = 0.1)
label_2015 <- ~scales::percent(YoY_2015, accuracy = 0.1)
# Load and transform the neighborhood shapefile
town_map_raw <- st_read("Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp")
## Reading layer `Merida_Den12_13_14_15' from data source 
##   `/Users/amanshaik/Documents/GitHub/dengueAnalysisTechWriting/Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 540 features and 6 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 212287.2 ymin: 2309495 xmax: 236386 ymax: 2333181
## Projected CRS: Mexico ITRF2008 / UTM zone 16N
town_map_wgs84 <- st_transform(town_map_raw, crs = 4326)

# Create the interactive leaflet map
m <- leaflet(data_sf_wgs84) %>%
  # Add base map layers
  addProviderTiles(providers$CartoDB.Positron, group = "Light Map") %>%
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") %>%
  
  # Add Neighborhoods layer
  addPolygons(
    data = town_map_wgs84,
    group = "Neighborhoods",
    fillOpacity = 0.1,
    color = "#444",
    weight = 1,
    opacity = 0.7,
    popup = ~~SP_ID
  ) %>%
  
  # Set the map to focus on the data
  fitBounds(
    ~as.numeric(st_bbox(geometry)[1]),
    ~as.numeric(st_bbox(geometry)[2]),
    ~as.numeric(st_bbox(geometry)[3]),
    ~as.numeric(st_bbox(geometry)[4])
  ) %>%

  # ==== Layer 1: 2013 vs 2012 ====
  addCircleMarkers(
    group = "2013 vs 2012",
    color = ~pal_yoy(YoY_2013),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2013,
    label = label_2013
  ) %>%
  
  # ==== Layer 2: 2014 vs 2013 ====
  addCircleMarkers(
    group = "2014 vs 2013",
    color = ~pal_yoy(YoY_2014),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2014,
    label = label_2014
  ) %>%

  # ==== Layer 3: 2015 vs 2014 ====
  addCircleMarkers(
    group = "2015 vs 2014",
    color = ~pal_yoy(YoY_2015),
    fillOpacity = 0.8,
    stroke = FALSE,
    radius = 5,
    popup = popup_2015,
    label = label_2015
  ) %>%
  
  # ==== Legend ====
  addLegend(
    pal = pal_yoy,
    values = cap_range,
    title = "YoY % Change",
    position = "bottomright",
    labFormat = labelFormat(suffix = "%", transform = function(x) x * 100),
    opacity = 1
  ) %>%

  # ==== Layer Control ====
  addLayersControl(
    baseGroups = c("2013 vs 2012", "2014 vs 2013", "2015 vs 2014"),
    overlayGroups = c("Light Map", "Satellite", "Neighborhoods"),
    options = layersControlOptions(collapsed = FALSE)
  )
## Warning in pal_yoy(YoY_2013): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2013): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2014): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2014): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2015): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2015): Some values were outside the color scale and will
## be treated as NA
# Display the map
m

8. Conclusion and Future Questions